1 - Getting Started with R

This R Markdown file will show how some R basics and cover the how to produce the various plots from R that you find in Section 1 of the course notes. Before working through this R Markdown file you will need to install the package car from CRAN and also load the Regression.RData directory that sent you via e-mail.

Example - West Bearskin Lake Smallmouth Bass

The file (http://course1.winona.edu/bdeppa/Regression/Data/wblake.txt) in the Dataset section of the course website contains 3 columns of information giving the age, length, and scale radius for a sample of $n=439$ smallmouth bass. We will read these data into R using the command read.csv which will be the primary way we will read all subsequent course datasets into R. The read.csv command will read the comma delimited data into an R object called a data frame. After reading a data frame into R, it is always a good idea to make sure it read in correctly and the data types are as you expected them to be. View(data frame) will display the data frame as a spreadsheet, str gives you information about the data types, head displays the first few rows of the data frame, and summary gives you summary statistics for each variable appropriate for their data type.

require(car)

## Loading required package: car

Bass = read.csv(file="http://course1.winona.edu/bdeppa/Regression/Data/wblake.txt")
View(Bass)
names(Bass)

## [1] "Age"    "Length" "Scale"

str(Bass)

## 'data.frame':    439 obs. of  3 variables:
##  $ Age   : int  1 1 1 1 1 1 2 4 5 5 ...
##  $ Length: int  71 64 57 68 72 80 108 154 180 180 ...
##  $ Scale : num  1.91 1.88 1.1 1.33 1.59 ...

head(Bass)

##   Age Length   Scale
## 1   1     71 1.90606
## 2   1     64 1.87707
## 3   1     57 1.09736
## 4   1     68 1.33108
## 5   1     72 1.59283
## 6   1     80 1.91602

summary(Bass)

##       Age            Length          Scale       
##  Min.   :1.000   Min.   : 55.0   Min.   : 1.054  
##  1st Qu.:2.500   1st Qu.:138.5   1st Qu.: 3.571  
##  Median :5.000   Median :194.0   Median : 5.786  
##  Mean   :4.203   Mean   :193.0   Mean   : 5.864  
##  3rd Qu.:6.000   3rd Qu.:252.0   3rd Qu.: 8.018  
##  Max.   :8.000   Max.   :362.0   Max.   :14.710

hist(Bass$Length,main="Length of Smallmouth Bass",xlab="Length (mm)")

Here we have read the wblake.txt file into a data frame called Bass. The names(data frame name) command show the names of the variables/columns in the data frame. We can refer to individual variables by typing the name of the data frame followed by a $ and the name of variable as we did above with the Length column when creating the histogram above, i.e. Bass$Length.

Statplot is a function that I wrote to provide a series of univariate displays for a continuous variable. These functions return four univariate displays for continuous variables, a histogram with a kernel density estimate, univariate boxplot, symmetry plot, and a normal quantile plot. The normal quantile plot also gives the p-value from the Shapiro-Wilkes test for normality. If the reported p-value is less than 0.05 then we reject the null hypothesis that the distribution of the variable being consider is normally distributed or Gaussian. This function is in the Regression.RData directory I sent you. An example of the output from this function is shown on page 11 of the notes. The function takes a single numeric variable as argument and you can use the xname="variable name" option to include the variable name in plot titles. In the code chunk below we also produce the histogram on pg. 10 of the notes and the empirical CDF found on pg. 12.

Statplot(Bass$Length,xname="Length")

hist(Bass$Length,prob=T,ylim=c(0,.006),main="Lengths of Smallmouth Bass")
lines(density(Bass$Length))
rug(Bass$Length)

plot(ecdf(Bass$Length),xlab="Length (mm)",ylab="F(x) = P(X < x)",main="Empirical CDF for Length")

On pages 19 and 21 of the notes I use R to find p-values and quantiles from a t-distribution. Those commands are demonstrated below.

pt(-2.2052,df=438,lower.tail=T)  # p-value from pg. 19

## [1] 0.01397947

#
# QUANTILES FOR CONFIDENCE INTERVALS
#
qt(.025,df=438) # For 95% CI

## [1] -1.965395

qt(.975,df=438) # For 95% CI

## [1] 1.965395

qt(.005,df=438) # For 99% CI

## [1] -2.5871

qt(.995,df=438) # For 99% CI

## [1] 2.5871

qt(.05,df=438) # For 90% CI

## [1] -1.64834

qt(.95,df=438) # For 90% CI

## [1] 1.64834

When we are going to be using R in the course I will continually make R Markdown files to help with R usage.

1 - Getting Started with R

Brant Deppa - Professor, Statistics & Data Science - STAT 360 Regression Analysis

August 29th, 2018

Example - West Bearskin Lake Smallmouth Bass

Cheers!